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Abstract 

Reinforcement learning metiiods are increasingly used to op- 
timise dialogue policies from experience. Most current tech- 
niques are model-free: they directly estimate the utility of vari- 
ous actions, without explicit model of the interaction dynamics. 
In this paper, we investigate an alternative strategy grounded in 
model-based Bayesian reinforcement learning. Bayesian infer- 
ence is used to maintain a posterior distribution over the model 
parameters, reflecting the model uncertainty. This parameter 
distribution is gradually refined as more data is collected and 
simultaneously used to plan the agent's actions. 
Within this learning framework, we carried out experiments 
with two alternative formalisations of the transition model, one 
encoded with standard multinomial distributions, and one struc- 
tured with probabilistic rules. We demonstrate the potential of 
our approach with empirical results on a user simulator con- 
structed from Wizard-of-Oz data in a human-robot interaction 
scenario. The results illustrate in particular the benefits of cap- 
turing prior domain knowledge with high-level rules. 
Index Terms: dialogue management, reinforcement learning, 
Bayesian inference, probabilistic models, POMDPs 

1. Introduction 

Designing good control policies for spoken dialogue systems 
can be a daunting task, due both to the pervasiveness of speech 
recognition errors and the large number of dialogue trajecto- 
ries that need to be considered. In order to automate part of 
the development cycle and make it less prone to design errors, 
an increasing number of approaches have come to rely on rein- 
forcement learning (RL) techniques | fl[2ll3ll4ll5ll6H7lf8 1 to auto- 
matically optimise the dialogue policy. The key idea is to model 
dialogue management as a Markov Decision Process (MDP) or 
a Partially Observable Markov Decision Process (POMDP), and 
let the system learn by itself the best action to perform in each 
possible conversational situation via repeated interactions with 
a (real or simulated) user. Empirical studies have shown that 
policies optimised via RL are generally more robust, flexible 
and adaptive than their hand-crafted counterparts l2ll9l. 

To date, most reinforcement learning approaches to pol- 
icy optimisation have adopted model-free methods such as 
Monte Carlo estimation |8|, Kalman Temporal Differences 
[101, SARSA(A) |6|, or Natural Actor Critic |11|. In model- 
free methods, the learner seeks to directly estimate the expected 
return (Q-value) for every state-action pairs based on the set of 
interactions it has gathered. The optimal policy is then simply 
defined as the one that maximises this Q-value. 

In this paper, we explore an alternative approach, inspired 
by recent developments in the RL community: model-based 
Bayesian reinforcement learning |12II13|. In this framework, 
the learner doesn't directly estimate Q-values, but rather grad- 
ually constructs an explicit model of the domain in the form of 



transition, reward and observation models. Starting with some 
initial priors, the learner iteratively refines the parameter es- 
timates using standard Bayesian inference given the observed 
data. These parameters are then subsequently used to plan the 
optimal action to perform, taking into consideration every pos- 
sible source of uncertainty (state uncertainty, stochastic action 
effects, and model uncertainty). 

In addition to providing an elegant, principled solution 
to the exploration-exploitation dilemma |12|, model-based 
Bayesian RL has the additional benefit of allowing the system 
designer to directly incorporate his/her prior knowledge into the 
domain models. This is especially relevant for dialogue man- 
agement, since many domains exhibit a rich internal structure 
with multiple tasks to perform, sophisticated user models, and a 
complex, dynamic context. We argue in particular that models 
encoded via probabilistic rules can boost learning performance 
compared to unstructured distributions. 

The contributions of this paper are twofold. We first demon- 
strate how to apply model-based Bayesian RL to learn the tran- 
sition model of a dialogue domain. We also compare two mod- 
elling approaches in the context of a human-robot scenario 
where a Nao robot is instructed to move around and pick up ob- 
jects. The empirical results show that the use of structured rep- 
resentations enables the learning algorithm to converge faster 
and with better generalisation performance. 

The paper is structured as follows, [plreviews the key con- 
cepts of reinforcement learning. We then describe how model- 
based Bayesian RL operates (!|3l and detail two alternative for- 
malisations for the domain models (SpJ. We evaluate the learn- 
ing performance of the two models in !j5] SJ6]compares our ap- 
proach with previous work, and ^concludes. 



2. Background 



2.1. POMDPs 



Drawing on previous work 1141 171 [8lll5lll61, we formalise dia- 
logue management as a Partially Observable Markov Decision 
Process (POMDP) (S, A, O, T, Z, R), where 5" represents the 
set of possible dialogue states s, A the set of system actions am , 
and O the set of observations - here, the N-best lists that can be 
generated by the speech recogniser. T is the transition model 
P(s'js, am) determining the probability of reaching state s' af- 
ter executing action am in state s. Z is the probability P{o\s) 
of observing o when the current (hidden) state is s. Finally, 
R{s, am) is the reward function, which defines the immediate 
reward received after executing action a™, in state s. 

In POMDPs, the current state is not directly observable 
by the agent, but is inferred from the observations. The agent 
knowledge at a given time is represented by the belief state b, 
which is a probability distribution P{s) over possible states. 
After each system action am and subsequent observation o, the 



belief state b is updated to incorporate the new information: 

b'{s) = P{s'\b, a,^,o) = aP{o\s') ^ P{s'\s, a™)fe(s) (1) 

s 

where q is a normalisation constant. 

In line with other approaches |7|, we represent the belief 
state as a Bayesian Network and factor the state s into three 
distinct variables s — {a„, iu,c}, where Ou is the last user di- 
alogue act, iu the current user intention, and c the interaction 
context. Assuming that the observation o only depends on the 
last user act au, and that au depends on both the user intention 
iu and the last system action am, Eq. ifTl is rewritten as; 

b'{au,iu,c) = P{a'u,i'u,c'\b,am,o) (2) 

= aP{o\a'u)P{a'u\i'u,am)'^P{i'u\iu,am,c)b{iu,c) (3) 

P{o\a'u) is often defined as P{au), the dialogue act probabil- 
ity in the N-best list provided by the speech recognition and se- 
mantic parsing modules. P{a'u\i'u,cim) is called the user action 
model, while P(ijj|i„, am, c) is the user goal model. 

2.2. Decision-making with POMDPs 

The agent objective is to find the action a,„ that maximise its 
expected cumulative reward Q. Given a belief state-action se- 



quence [6o, ao, bi,ai, 



and a discount factor 7, the 



expected cumulative reward is defined as: 



Q([6o,ao,fei,ai,...b„ 



= ^l^R{bt,at) 



(4) 



where R(b,a) — ^^^g, R{s,a)b{s). Using the fixed point 
of Bellman's equation 1171 . the expected return for the optimal 
policy can be written in the following recursive form: 

Q{b, a) = R{b, a) + V P{o\b, a) maxQ(6', a) (5) 

— ^ a' 

oeo 

where b' is the updated dialogue state following the execution 
of action a and the observation of o, as in Eq. [T| For notational 
convenience, we used P{o\b, a) = Xlses P{o\s, a)b{s). 

If the transition, observation and reward models are known, 
it is possible to apply POMDP solution techniques to extract an 
optimal policy tt : 6 — >■ a mapping from a belief point to the 
action yielding the maximum Q- value I18l|19ll20l . 

Unfortunately, for most dialogue domains, these models are 
not known in advance. It is therefore necessary to collect a 
large amount of interactions in order to estimate the optimal 
action for each given (belief) state. This is typically done by 
trial-and-error, exploring the effect of all possible actions and 
gradually focussing the search on those yielding a high return 
(21 1. Due to the number of interactions that are necessary to 
reach convergence, most approaches rely on user simulators for 
the policy optimisation. These user simulators are often boot- 
strapped from Wizard-of-Oz experiments in which the system 
is remotely controlled by a human expert 1221 . 

3. Approach 

Contrary to model-free methods that directly estimate the policy 
or Q-value of (belief) state-action pairs, model-based Bayesian 
reinforcement learning relies on explicit transition, reward and 
observation models. These models are gradually estimated from 
the data collected by the learning agent, and are simultane- 
ously used to plan the actions to execute. Model estimation 
and decision-making are therefore intertwined. 



3.1. Bayesian learning 

The estimation of the model parameters is done via Bayesian 
inference - that is, the learning algorithm maintains a posterior 
distribution over the parameters of the POMDP models, and 
updates these parameters given the evidence. 

We focus in this paper on the estimation of the transition 
model P{s'\s, am)- It should however be noted that the same 
approach can in principle be applied to estimate the observa- 
tion and reward models 1 12|. The transition model can be de- 
scribed as a collection of multinomials (one for each possible 
conditional assignment of s and am)- It is therefore convenient 
to describe their parameters with Dirichlet distributions, which 
are the conjugate prior of multinomials. 

Fig. [T] illustrates this 
estimation process. The 

two parameters 9ii \i^,a?n and 
Sa'ji[^.a^,c respectively rep- 
resent the Dirichlet distribu- 
tions for the user goal and user 
action models. Once a new N- 
best list of user dialogue acts 
is received, these parameters 
are updated using Bayes' rule, 
i.e. P{e\o) = aP{o\e)- 

The operation is repeated 
for every observed user act. 
To ensure the algorithm re- 
mains tractable, we assume 
conditional independence be- 
tween the parameters, and we 
approximate the inference via importance sampling. 

3.2. Online planning 

After updating its belief state and parameters, the agent must 
find the optimal action to execute, which is the one that max- 
imises its expected cumulative reward. This planning step is 
the computational bottleneck in Bayesian reinforcement learn- 
ing, since the agent needs to reason not only over all the current 
and future states, but also over all possible transition models 
(parametrised by the 6 variables). The high dimensionality of 
the task usually prevents the use of offline solution techniques. 
But several approximate methods for online POMDP planning 
have been developed |23| . In this work, we used a simple for- 
ward planning algorithm coupled with importance sampling. 

Algorithm 1 : Q (b, a, h) 




Figure 1: Bayesian pa- 
rameter estimation of the 
transition model. 
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q^j:,b{s)R{s,a) 
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ith>l then 
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b'^j:,P{s'\s,a)bis) 
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V ^0 


5 


for observation G O do 


6: b"^j:^Pio\s)b'{s) 


1: Estimate Q{b" , a', h~l) for afl actions a' 


8: v^v + P{o\b')m&Xa,Q{b",a',h-l) 


9: end for 


10: q <— g + 7 u 


11: end if 


12: return q 



AlgorithmfT] shows the iterative calculation of the Q-value 
for a belief state 6, action a and planning horizon h- The al- 
gorithm starts by computing the immediate reward, and then 



estimates the expected future reward after the execution of the 
action. Line 5 loops on possible observations following the 
action (for efficiency reasons, only a limited number of high- 
probability observations are selected), and for each, the be- 
lief state is updated and its maximum expected reward is com- 
puted. The procedure stops when the planning horizon has been 
reached, or the algorithm has run out of time. The planner then 
simply selects the action a* — arg max Q{b, a). 

4. Models 

We now describe two alternative modelling approaches devel- 
oped for the transition model. 

4.1. Model 1: multinomial distributions 

The first model is constructed using standard multinomial dis- 
tributions, based on the factorisation described in §2.1. 

Both the user action model P{a'^\i'u, am) and the user goal 
model P{i'^\iu, am, c) are defined as collections of multinomi- 
als whose parameters are encoded with Dirichlet distributions. 
It is possible to exploit prior domain knowledge about the rela- 
tive likelihood of some event by adapting the a Dirichlet counts 
to skew the distribution in a particular direction. For instance, 
we can encode the fact that the user is unlikely to change his 
intention after a clarification request by associating a higher a 
value to the intention i'^ corresponding to the current value iu 
when am is a clarification request. 

4.2. Model 2: probabilistic rules 

The second model relies on probabilistic rules to capture the 
domain structure in a compact manner and thereby reduce the 
number of parameters to estimate. We provide here a very brief 
overview of the formalism, previously presented in I24II25I . 

Probabilistic rules take the form of if.. .then. ..else control 
structures and map a list of conditions on input variables to 
specific effects on output variables. A rule is formally ex- 
pressed as an ordered list (ci,...c„), where each case c^ is 
associated with a condition 0^ and a distribution over effects 
{{i'i ,Pi)j •••! (V'fiPf )}i where Tpj is an effect with associated 
probability p^ — P{ipl\4>i). Note that pj'""^ must satisfy the 
usual probability axioms. The rule reads as such: 



if( 



I then 



{[Pii'D^pl], ... [P(^f) = pj]} 



else if ( 



then 



{[Pi^n] 



-pi], 



[Pii'n)=Pn]} 



The conditions (f)i are arbitrarily complex logical formu- 
lae grounded in the input variables. Associated to each condi- 
tion stands a list of alternative effects that define specific value 
assignments for the output variables. Each effect is assigned 
a probability that can be either hard-coded or correspond to a 
Dirichlet parameter to estimate (as in our case). 

Here is a simple example of probabilistic rule: 

Rule : if {am ~ Confirm (X) A iu 7^ X) then 
{[P{a'u = Disconfirm) = Oi]} 

The rule specifies that, if the system requests the user to confirm 
that his intention is X, but the actual intention is different, the 
user will utter a Disconfirm action with probability 9i (which 



is presumably quite high). Otherwise, the rule produces a void 
effect - i.e. it leaves the distribution P{a'^) unchanged. 

At runtime, the rules are instantiated as additional nodes in 
the Bayesian Network encoding the belief state. They therefore 
function as high-level templates for a plain probabilistic model. 
We refer once more the reader to I24II25I for details. 




Figure 2: User interact- 
ing with the Nao robot. 



5. Evaluation 

We evaluated our approach within a human-robot interaction 
scenario. We started by gathering empirical data for our dia- 
logue domain using Wizard-of-Oz experiments, after which we 
built a user simulator on the basis of the collected data. The 
learning performance of the two models was finally evaluated 
on the basis of this user simulator. 

5.1. Wizard-of-Oz data collection 

The dialogue domain involved 
a Nao robot conversing with a 
human user in a shared visual 
scene including a few gras- 
pable objects, as illustrated in 
Fig. [2] The users were in- 
structed to command the robot 
to walk in different directions 
and carry the objects from one 
place to another. The robot 
could also answer questions 
about his current perception 
(e.g. "do you see a blue cylin- 
der?"). In total, the domain included 1 1 distinct user intentions, 
and the user inputs were classified into 16 dialogue acts. With 
the contextual variables, the domain included 2112 possible 
states. The robot could execute 37 possible actions, including 
both physical and conversational actions. 

8 interactions were recorded, each with a different speaker, 
totalling about 50 minutes. The interactions were performed in 
English. After the recording, the dialogues were manually seg- 
mented and annotated with dialogue acts, system actions, user 
intentions, and contextual variables (e.g. perceived objects). 

5.2. User simulator 

Based on the annotated dialogues, we used MLE to derive the 
user goal and action models, as well as a contextual model for 
the robot's perception. To reproduce imperfect speech recog- 
nition, we applied a speech recogniser (Nuance Vocon) to the 
Wizard-of-Oz user utterances and processed the recognition re- 
sults to derive a Dirichlet distribution with three dimensions re- 
spectively standing for the probability of the correct utterance, 
the probability of incorrect recognition, and the probability of 
no recognition. The N-best lists were generated by the simula- 
tor with probabilities drawn from this distribution, estimated to 
~ Dirichlet(5.4, 0.52, 1.6) with T Minka's method l26l . 

5.3. Experimental setup 

The simulator was coupled to the dialogue system to compare 
the learning performance of the two models. The multino- 
mial model contained 228 Dirichlet parameters. The rule-based 
model contained 6 rules with 14 corresponding Dirichlet pa- 
rameters. Weakly informative priors were used for the initial 
parameter distributions in both models. The reward model, in 
Tablefl] was identical in both cases. The planner operated with 



a horizon of length 2 and included an observation model intro- 
ducing random noise to the user dialogue acts. 



Execution of 


correct action 


+6 


wrong action 


-6 


Answer to 


correct question 


+6 


wrong question 


-6 


Grounding 


correct intention 


+2 


wrong intention 


-6 


Ask to confirm 


correct intention 


-0.5 


wrong intention 


-1.5 




Ask to repeat 


-1 


Ignore user act 


-1.5 



Table 1: Reward model designed for the domain. 

The performance was first measured in terms of average re- 
turn per episode, shown in Fig. [3] To analyse the accuracy 
of the transition model, we also derived the Kullback-Leibler 
divergence |27| between the next user act distribution P{a'^) 
predicted by the learned model and the actual distribution fol- 
lowed by the simulator at a given timaj(Fig. l4l. The results of 
both figures are averaged on 100 simulations. 




■O- Multinomial distributions 
O Probabilistic rules 

2 14 26 38 50 62 74 86 

Episodes 



98 110 122 134 146 



Figure 3: Average return per episode. 




^ 0,275 

■0- Multinomial distributions 
O Probabilistic rules 

10 50 90 130 170 210 250 290 330 370 410 450 

Number of turns 

Figure 4: K-L divergence between the estimated distribution 
P{a'^) and the actual distribution followed by the simulator. 



5.4. Analysis of results 

The empirical results illustrate that both models are able to 
capture at least some of the interaction dynamics and achieve 
higher returns as the number of turns increases, but they do so at 
different learning rates. In our view, this difference is to be ex- 
plained by the higher generalisation capacity of the probabilistic 
rules compared to the unstructured multinomial distributions. 

It is interesting to note that most of the Dirichlet param- 
eters associated with the probabilistic rules converge to their 
optimal value very rapidly, after a handful of episodes. This is 
a promising result, since it implies that the proposed approach 
could in principle optimise dialogue policies from live interac- 
tions, without the need to rely on a user simulator, as in (3. 



6. Related work 

The first studies on model-based reinforcement learning for di- 
alogue management have concentrated on learning from a fixed 
corpus via Dynamic Programming methods 1281 1291 l30l . The 
literature also contain some recent work on Bayesian tech- 
niques. |31| presents an interesting approach that combines 
Bayesian inference with active learning. |32| is another related 
work that utilises a sample of solved POMDP models. Both 
employ offline solution techniques. To our knowledge, the only 
approaches based on online planning are 1 15 33 1, although they 
focussed on the estimation of the observation model. 

It is worth nothing that most POMDP approaches do in- 
tegrate statistically estimated transition models in their belief 
update mechanism, but they typically do not exploit this infor- 
mation to optimise the dialogue policy, preferring to employ 
model-free methods for this purpose (8l l34l . 

Interesting parallels can be drawn between the structured 
modelling approach adopted in this paper (via the use of proba- 
bility rules) and related approaches dedicated to dimensionality 
reduction in large state-action spaces, such as function approx- 
imation (6|, hierarchical RL |5|, summary POMDPs |8|, state 
space partitioning |35! f361 or relational abstractions |37j. These 
approaches are however typically engineered towards a partic- 
ular type of domain (often slot-filling applications). There has 
also been some work on the integration of expert knowledge 
using finite-state policies or ad-hoc constraints |38, 6|. In these 
approaches, the expert knowledge operates as an external filter- 
ing mechanism, while the probabilistic rules aim to incorporate 
this knowledge into the structure of the statistical model. 

7. Conclusion 

We have presented a model-based Bayesian reinforcement 
learning approach to the estimation of transition models for di- 
alogue management. The method relies on an explicit repre- 
sentation of the model uncertainty via a posterior distribution 
over the model parameters. Starting with an initial Dirichlet 
prior, this distribution is continuously refined through Bayesian 
inference as more data is collected by the learning agent. An 
approximate online planning algorithm selects the next action 
to execute given the current belief state and the posterior distri- 
bution over the model parameters. 

We evaluated the approach with two alternative models, one 
using multinomial distributions and one based on probabilistic 
rules. We conducted a learning experiment with a user simu- 
lator bootstrapped from Wizard-of-Oz data, which shows that 
both models improve their estimate of the domain's transition 
model during the interaction. These improved estimates are also 
reflected in the system's action selection, which gradually yields 
higher returns as more episodes are completed. The probabilis- 
tic rules do however converge much faster than multinomial dis- 
tributions, due to their ability to capture the domain structure in 
a limited number of parameters. 

In our future work, we would like to directly compare our 
approach with model-free RL methods such as Monte-Carlo es- 
timation or SARSA(A). We also want to extend the framework 
to estimate the reward model in parallel to the state transitions. 
And most importantly, we plan to conduct experiments with real 
users to verify that the outlined approach is capable of learning 
dialogue policies from direct interactions. 



' Some residual discrepancy is to be expected between these two dis- 
tributions, the latter being based on the actual user intention while the 
former must infer it from the current belief state. 
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